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In this article, we proposed an algorithm that determines missing/future 
anchor links between users from two different online social networks. Our 
algorithm utilizes the graph attention technique to represent the source and 
target network into the low-dimension embedding spaces, we then apply the 
canonical correlation analysis to recline their embeddings into same latent 
spaces for final prediction. 
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1. INTRODUCTION 

Recently, with the diversity of different online social networks, anyone in real-life social network 
can take part in several online social networks for many different purposes. Most of them participate in these 
online social networks with the same or similar user properties such as full name, username, and gender. 
However, in some cases, users may not disclose personal information consistently across different social 
networks. This leads to using this information to predict anchor links will not be accurate because of the 
noisy information. Thus, anchor link prediction, which is a task of matching users over social networks, is a 
major challenge and attracts a lot of attention from the scientific community up to present. 

There are many different methods for anchor link prediction problems. The initial studies handled the 
anchor link prediction problem by exploiting self-defined user profile and user generated contents to measure 
the similarity to get the prediction result. The traditional methods attempt to align users across online social 
networks using self-defined user personal profile such as name, gender, age, location [1]-[3], and user’s 
generated content such as tweets, posts, publications [4], [5]. Usually, the methods that follow this approach 
often use heuristics to process text data and compare similarity. These methods are sensitive to the similarity 
metrics or self-defined user information, and thus come with limitations due to the imbalance of users’ 
demographic data in different information networks and privacy issues in retrieving user profile information. 
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Recently, the rapid development of network embedding techniques [6]-[8] has opened a new 
research trend in the field of anchor link prediction. Network embedding is a task of learning the 
representation of a node in the network in which it has low dimension than the original one but still 
preserving the network properties and structure. Following this strategy, some methods [9]-[13] attempt to 
find the low-dimensional embedding of every node in the source and target online social networks using a 
graph neural network [14] which is a generalization of convolution neural networks to process some kind of 
data represented by the graph structure. Few years recently, graph neural networks (GNN) have been 
considered as a powerful and pragmatic technique for any problem that can be represented by graphs. 
Therefore, there are many variant models developed based on GNN, such as recurrent GNN [15], 
convolutional GNN [16], [17], graph auto-encoders [18], [19], and spatial-temporal GNN [20]. Graph 
attention network [21] is one of the modern models widely applied in fields such as link prediction [22]-[24], 
node classification [25], node clustering [26], recommendation system [27], [28], information diffusion [29], 
and in this paper, we apply this technique to resolve the anchor link prediction problem. 

MAUIL [30] is an anchor link prediction method which combines multiple embedding techniques to 
increase the accuracy of the anchor link prediction model. It uses three levels of attribute embedding 
techniques to preserve the node attributes of the network and use the Line method [31] to embed network 
information in terms of network structure. Although, this method has obtained great results, which has been 
demonstrated experimentally to give better results compared to other solutions. However, it also exposes 
some limitations in terms of performance as well as complexity. Inspired by MAUIL, we propose an anchor 
link prediction method based on the idea of graph attention mechanism to increase the accuracy and 
performance of this model. Here the main contributions of our paper: 

— We propose a combination method of anchor link prediction to improve the accuracy of the MAUIL 
method by substituting the network embedding method that is used in MAUIL by a graph attention 
mechanism to find the embedding of the source and target network. 

— We apply canonical correlation analysis to project their representation onto same latent embedding 
spaces and compute the alignment matrix of nodes between source and target network. The experiment 
on real life datasets shows that our proposed method is outperforming than the original one. 


2. PROPOSED METHOD 
Our proposed model is a combination of three modules: multilevel attribute embedding, graph 
attention network, and regularized canonical correlation analysis (RCCA)-based correlation analysis. 

Consequently, a total of four embedding matrices (three for attribute-based embedding and one for graph 

attention network (GAT)-based embedding) are integrated to establish the final embedding of each social 

networks. As described in the Figure 1, our model can be divided into three steps as shown in: 

— At first, we feed user’s attributes of source and target network into three-level embedding techniques to 
compute node embedding which preserves the attribute information of each user in network. 

— Then, we use attribute embedding as initial feature vector along with network structure as the input of 
graph attention mechanism. This process utilizes the contribution of neighbor nodes to the aggregation 
step to construct the final embedding. 

— Finally, we apply canonical correlation analysis to project the source and target embedding into the 
same latent representation. Then, we compute final network alignment matrix based on the similarity 
score of embedding vectors between source and target network. 


topic-level 


embedding | m |) Onn embedding 


Graph attention Graph attention 
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Figure 1. Framework overview for anchor link prediction 
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2.1. Multi-level attribute embedding techniques 
2.1.1. Character-level attribute embedding 

This embedding technique aims to learn the representation of node using the text similarity of 
similar usernames. Text content in the user’s name may contain multiple type of characters such as alphabets, 
numeric, spaces, and special symbols, and generally we see them as tokens. For each username, we count the 
frequency of each unique token to get a list of tuples t = [(t,: f1), (to: f2), (ts: f3), »- (tm: find], where ti is 
the i token which appears in username, fj is the number of occurrences of token tiin username. During the 
counting process, we also build a token dictionary which contains all unique tokens that appear in the 
usernames of all users. Finally, the corresponding count-weighted vector for user v1; is 


Xe = [fa fizo fip] € R? where fiis the frequency of jt token in the token dictionary which occurs in the 
username of user vi, p is the total number of unique tokens in dictionary. 


The embedding matrix for all node in each network X° = Ea XS, ec € R™*?P is reduced by 


apply the auto-encoder method which is an essential and powerful technique to reduce the dimensional of 
data. This work implements a one-layer auto-encoder to integrate the token frequency vectors into distributed 
embedding space. After this step, we acquire the feature matrix of each network at character-level 


P, = [Pr PE, = „R| € R”*4 where R" * ris the original high-dimensional and R” * 4is reduced-dimensional 


character-level representation of character level embedding. 


2.1.2. Word-level attribute embedding 

This embedding technique aims to learn the representation of node using the similarity of text of 
similar word group or short sentences such as affiliations, social relationship, and working experiences. We 
apply the Word2vec model [32], which is one of the most popular techniques for converting texts into feature 
vectors to embed the characteristics of user at word-level. The attribute embedding at word-level of the user 
viin a network G with n users is denoted by aj’ and may contain part of the word. We can represent the 
words in these short sentences using a sequence of m unique words w = Wy, W2, W3, ..., Wm. We use them as 
input corpus to build the vocabulary dictionary, target words, and contextual words list for each target word 
using a window size. Then, they are fed into the continuous bag of words (CBOW) model [32], a neural 
network model that predicts the target word by trying to understand the context of the surrounding words. 


1 
IN;il 


P” = (1 — AZ, + A YX jen; Zj (1) 


To clearly distinguish between word-level and character-level embedding, we enhance the word- 
level embedding of a node by adding the neighboring node’s embedding information. Thus, we use the (1) to 
regularize the embedding of each user by a real number parameter A € [0,1] along with the contribution of 
their neighbor embeddings. Where, Z; is the word-level embedding vector of i node of Word2vec model, Ni 
are the neighbor nodes of it node. 


2.1.3. Topic-level attribute embedding 

In this topic-level embedding, we use latent Dirichlet allocation (LDA) [33] which is a popular topic 
modeling technique to extract topics from a given corpus. This embedding technique aims to learn the 
representation of node using the text similarity of attribute texts of user in terms of long sentences or 
paragraphs such as description of books, projects, and published articles. All of this information is merged in 
order to create the corpus data as input for this embedding technique. We treat user’s attributes at topic-level 
as a document which may contain many words. Firstly, we clean, preprocess and tokenize the text corpus 
data to words. Then, we build a document-word matrix E€ Ri x IM where |D| is the number of 
documents/users, |W is the number of distinct words in the word-level dictionary. LDA converts this 
document-word matrix into two other matrices: document-topic matrix and topic-word matrix. 

The goal of this process is to find the most optimal representation of the document-topic matrix € 
RIPI * ITI and the topic-word matrix E€ KI’ xM through an iterative process, where | 7] is the number of topics. 
At the first iteration, it randomly assigns a list of topics to each word in a document to generate the initial 
document-topic and topic-word matrices. Then, LDA will iterate over each document Dj and each word Wjin 
document in order to update the correct topic for a specific word Wj with an assumption that all the topics that 
have been assigned are correct except the current word. 


2.2. Graph attention network 
We apply graph attention mechanism [21], an efficient graph neural network to find the low 


dimensional embedding which maximizes the preservation of network attributes and local structure. The 
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GAT model has a mechanism to assign the different attention coefficients to all neighborhoods of a node at 
the aggregation step to increase the performance of prediction tasks. We also employ a multi-head attention 
mechanism to prevent noisyness and make the prediction model more stable. 


lij = a" [W x; || wx] (2) 


First, we use the embedding vector generated from the multilevel attribute embedding technique (as 
proposed in subsection 2.1.) as the initial feature vector for the GAT model. We denote X; as the feature 
vector of a user node vi in network G, and xX is the initial feature vector of the user node vj € Niare its 
neighbors. We use (2) to compute the important score l;; between user node v; and all its neighbors 
vj,1 <j <|N;|. Where, W € R?*? is the weight matrix and 6 € R?” is the weight vector that is the model 
parameters, D is the original dimension of the initial feature vector, D' is the dimension of the hidden layer. 


exp (LeakyReLU (l; j)) 
a EE m -N 
a a exp (LeakyReLU(l;¢)) 


(3) 
These important scores are then typically normalized using the softmax function, in order to be 
comparable across different neighborhoods. We use the (3) to compute the normalized attention coefficient aij 
between node viand all its neighbors vi. Where, LeakyReLUis a type of activation function based on a ReLU. 
> 1 i = 
T= o CDE, Dil ak ws) (4) 
Finally, we compute the embedding vector of user node viin graph G using (4) which implements a 


multi-head attention to stabilize the self-attention learning process. In which, K is the number of attention 
mechanisms and k is the k* attention mechanism. 


2.3. Canonical correlation analysis 

After representing the attribute level and the structure level of each network in the above steps, we 
will have the embedding matrices X € R4* "and Y € R4* representing the information of the source network 
and the target network. Where m and n are the number of users in the source and target networks, d is the 
final dimensional concatenate from the four embedding techniques mentioned (5). 


T 
hi Cxy mj 


let Cxx hi)(m? Cyy mj) 


p = maxcorr(h] X,mj Y) = max 


(5) 


We use canonical correlation analysis (CCA) technique to represent these two distinct spaces X, Y on 
the same common semantic space. CCA is a technique for learning the linear correlational relations among 
multiple multidimensional datasets. CCA finds a canonical latent space that maximizes association between 
projections of these datasets onto that common space. RCCA technique mostly define the canonical matrices 
as A = [al; a2;...; ak] E Rdxk and B = [b1; b2; ...; bk] E Rd xk, it includes k pairs of linear 
projections. We use (5) to find the canonical matrices A and B which maximize the correlation source and 
target network. The canonical matrices of the anchor link problem are resolved by projecting the embedding 
of the associated social networks G*/GY into the canonical matrices A and B to get the common correlated 
space Z* = ATX € R¥*” and ZY = BTY € RKx™ 


3. EXPERIMENTAL 
3.1. Datasets 

We experiment our proposed model in two real-life alignment datasets [30]. Table 1 illustrates the 
statistics of these data sets. 

— Weibo vs Douban. This dataset was gathered from two well-known Chinese social networks, Weibo 
and Douban. It contains 1,397 anchor links. 

— Data base and logic programming17 (DBLP17) vs DBLP19. This dataset was collected from computer 
science bibliography website. In this research, two snapshots of the DBLP network were collected in 
two different periods of time and we treat them as one pair of alignment networks. The anchor links 
were constructed from the authors who have the same unique key in two different snapshots of network, 
and it contains 2,832 anchor links. 
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Table 1. Statistics of real-life data sets 
#Nodes #Edges #Anchors 
Weibo 9,714 117,218 1,397 
Douban 9,526 120,245 
DBLP17 9,086 51,700 2,832 
DBLP19 9,325 47,7175 


3.2. Evaluation metrics 

In our study, we used the Hit-precision metric [1] to assess the performance of the proposed method. 
The Hit-precision can be computed using (7). This metric measures the number of true anchor link appears in 
the top-k candidates. 


k-(hit(x)-1) 


h(x) = X 


(6) 

In (6), hit (x) is the position of a truly predicted user in the top-k candidates of the output collection. 
Suppose that n is the number of assessed user pairs, we can compute the hit-precision using the average score 
of the truly predicted user pairs. 


Hit-precision = X% h(x;) So 


3.3. Baselines 
We compare our proposed method with the following state-of-the-art anchor link prediction methods: 
— MAUIL contains three modules: attribute-based embedding module, network-based embedding 
module, and RCCA-based module. 
— Our method is an improvement of MAUIL by adding the graph attention mechanism. This method 
includes four modules: attribute-based embedding, network structure -based embedding module, graph 
attention embedding module, and RCCA-based module. 


3.4. Performance comparison 

We compare the performance of our proposed method with the baselines. In the experiment, all the 
hyperparameters of both compared methods and our method are tuned to perform the best on the test dataset. 
For our method, the output dimensional for each embedding technique is set to the same value D = 100, the 
number of canonical components is empirically set to k = 80 for both the Weibo-Douban and DBLP 
datasets. Correspondingly, the regulation parameters R = 1000 are considered for the Weibo-Douban and 
DBLP dataset, respectively. 

Table 2 is convincing results on the prediction of the anchor link for the DBLP17-DBLP19 and 
Weibo-Douban dataset. From this table, we can discover that our model consistently outperforms all 
baselines in two pairs of datasets. We also tested our model on many different training ratios and evaluate the 
performance on difference hit-precision. Figure 2(a) shown the performance on hit-precision@5, Figure 2(b) 
shown the performance on hit-precision@10, Figure 2(c) shown the performance on hit-precision@20, 
Figure 2(d) shown the performance on hit-precision@30 for DBLP dataset. Similarly, Figure 3(a) shown the 
performance on hit-precision@5, Figure 3(b) shown the performance on hit-precision@10, Figure 3(c) 
shown the performance on hit-precision@20, Figure 3(d) shown the performance on hit-precision@30 for 
DBLP dataset. The experimental results show that, our model gives consistently better results for all training 
ratios. The accuracy of the model is proportional to the training data, when the training data reaches about 
50%-60%, the accuracy almost converges. 


Table 2. Comparison of Hit-precision with training ratio 30% 


Metric Weibo vs Douban DBLP-17 vs DBLP-19 

Our method MAUIL Our method MAUIL 

Hit-precision@ 1 0.2380 0.2310 0.7720 0.7510 
Hit-precision@3 0.2913 0.2797 0.8060 0.7810 
Hit-precision@ 5 0.3236 0.3099 0.8224 0.7970 
Hit-precision@ 10 0.3774 0.3597 0.8412 0.8213 
Hit-precision@ 15 0.4104 0.3920 0.8543 0.8324 
Hit-precision@20 0.4326 0.4163 0.8636 0.8379 
Hit-precision@25 0.4488 0.4361 0.8695 0.8413 
Hit-precision@30 0.4615 0.4527 0.8747 0.8435 
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Figure 2. Precision@K comparision on DBLP dataset (a) precision@5, (b) precision@ 10, (c) precision @20, 
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Figure 3. Precision@K comparision on Weibo-Douban dataset (a) precision @5, (b) precision@ 10, 
(c) precision @20, and (d) precision @30 
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4. CONCLUSION 

In this article, we study and apply multilevel embedding techniques to learn the representation of 
nodes in online social networks. We project learned embedding onto same latent space using canonical 
correlation analysis and apply the models in representation learning along with other techniques to predict the 
formulation of the anchor link across information networks, specific tasks in information network analysis. 
The experiments on the real-life dataset indicate that our method can substantially enhance the precision 
compare to the traditional methods. The following is a summary of our contributions in this article: i) we 
have learned theoretical knowledge related to network representation learning, graph attention network, and 
anchor link prediction; ii) we combined the multilevel embedding techniques for text-based attributes, graph 
attention mechanism, and canonical correlation analysis into the anchor link prediction; and iii) we have 
experimented with two real-life data sets. We also have evaluated and compared experimental results with 
related algorithms and our model consistently outperforms all baselines. 
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